Simple example which will:
rows parameter setting for training the rankerTo learn more about the data used in the experiment, see here: https://github.ibm.com/rchakravarti/rnr-debugging-scripts/tree/master/resources/insurance_lib_v2
Note: Ensure credentials have been updated in config/config.ini
In [1]:
import sys
from os import path, getcwd
import json
from tempfile import mkdtemp
import glob
sys.path.extend([path.abspath(path.join(getcwd(), path.pardir))])
from rnr_debug_helpers.utils.rnr_wrappers import RetrieveAndRankProxy, \
RankerProxy
from rnr_debug_helpers.utils.io_helpers import load_config, smart_file_open, \
RankerRelevanceFileQueryStream, initialize_query_stream, insert_modifier_in_filename, PredictionReader
from rnr_debug_helpers.create_cross_validation_splits import split_files_into_k_cv_folds
from rnr_debug_helpers.generate_rnr_feature_file import generate_rnr_features
from rnr_debug_helpers.compute_ranking_stats import compute_performance_stats
from rnr_debug_helpers.calculate_recall_at_varying_k_on_base_display_order import compute_recall_stats, \
print_recall_stats_to_csv
config_file_path = path.abspath(path.join(getcwd(), path.pardir, 'config', 'config.ini'))
print('Using config from {}'.format(config_file_path))
config = load_config(config_file_path=config_file_path)
insurance_lib_data_dir = path.abspath(path.join(getcwd(), path.pardir, 'resources', 'insurance_lib_v2'))
print('Using data from {}'.format(insurance_lib_data_dir))
Refer to the sample notebook 1.0 - Create RnR Cluster & Train Ranker for help setting up a cluster.
In [2]:
cluster_id = "sc40bbecbd_362a_4388_b61b_e3a90578d3b3"
collection_id = 'TestCollection'
bluemix_wrapper = RetrieveAndRankProxy(solr_cluster_id=cluster_id,
config=config)
if not bluemix_wrapper.collection_previously_created(collection_id):
raise ValueError('Must specify one of the available collections: {}'.
format(bluemix_wrapper.bluemix_connection.list_collections(self.solr_cluster_id)))
The InsuranceLibV2 actually provides separate training and validation splits. But for this demo, we pretend we only have access to the 2,000 question dev subset of labelled ground truth data. So we will have to split the data ourselves.
This data is included in this repository already formatted in the relevance file format. However, if your ground truth was annotated using the RnR Web UI, you can export it and use the RnRToolingExportFileQueryStream to read the ground truth in directly from the export-questions.json that can be downloaded from the RnR Web UI.
In [2]:
experimental_directory = mkdtemp()
number_of_folds = 3
with smart_file_open(path.join(insurance_lib_data_dir, 'validation_gt_relevance_file.csv')) as infile:
split_files_into_k_cv_folds(initialize_query_stream(infile, file_format='relevance_file'),
experimental_directory, k=number_of_folds)
print('\nCreated train and validation splits in directory: {}'.format(experimental_directory))
for filename in glob.glob('{}/*/*.csv'.format(experimental_directory), recursive=True):
print(filename)
The rows setting defines how many search results you wish to evaluate for relevance w.r.t. the ground truth annotations. As a starting point, you will likely want to set a large rows parameter so you can observe recall at varying depths in the search result.
I primarily use NDCG as the metric of evaluation (see here for why); but you can choose an alternative metric that makes sense for the application.
In [4]:
rows = 100
average_ndcg = 0.0
ndcg_evaluated_at = 50
for i in range(1, number_of_folds + 1):
test_set = path.join(experimental_directory, 'Fold%d' % i, 'validation.relevance_file.csv')
prediction_file = insert_modifier_in_filename(test_set,'fcselect_predictions','txt')
with smart_file_open(test_set) as infile:
# generate predictions
labelled_test_questions = RankerRelevanceFileQueryStream(infile)
json.dump(bluemix_wrapper.generate_fcselect_prediction_scores(
test_questions=labelled_test_questions, num_rows=rows,
prediction_file_location=prediction_file, collection_id=collection_id), sys.stdout, sort_keys=True, indent=4)
# score them
labelled_test_questions.reset()
with smart_file_open(prediction_file) as preds_file:
prediction_reader = PredictionReader(preds_file)
stats_for_fold, _ = compute_performance_stats(prediction_reader=prediction_reader,
ground_truth_query_stream=labelled_test_questions,
k=ndcg_evaluated_at)
print('\nPerformance on Fold %d' % i)
json.dump(stats_for_fold, sys.stdout, sort_keys=True, indent=4)
average_ndcg += stats_for_fold['ndcg@%d' % ndcg_evaluated_at]
average_ndcg /= number_of_folds
print('\nAverage NDCG@%d across folds: %.2f' % (ndcg_evaluated_at, average_ndcg))
At this point, you can experiment with tweaking the solr schema and document ingestion process and assess performance with each tweak to see its impact on the performance of your retrieval system. It's important to tweak one thing at a time so that you can assess their impact in isolation of other changes.
You would like to increase the likelihood of overlapping terms between the query and the correct answer documents. So some things to try include:
/fcselect?rows Setting for Ranker TrainingAfter you've tweaked the first pass search from base Solr, now we can train a ranker which will re-order the rows of the search result.
As alluded to in the previous step, we can choose a reasonable rows setting by evaluating recall at varying depths in the search results.
TODO: Show plot instead of listing a csv
In [5]:
average_recall_over_folds = None
recall_settings = range(10, rows +1, 10)
for i in range(1, number_of_folds + 1):
print('\nComputing recall stats for fold %d' % i)
test_set = path.join(experimental_directory, 'Fold%d' % i, 'validation.relevance_file.csv')
prediction_file = insert_modifier_in_filename(test_set,'fcselect_predictions','txt')
with smart_file_open(test_set) as infile:
labelled_test_questions = RankerRelevanceFileQueryStream(infile)
with smart_file_open(prediction_file) as preds_file:
prediction_reader = PredictionReader(preds_file)
recall_stats = compute_recall_stats(recall_settings, labelled_test_questions, prediction_reader)
if average_recall_over_folds is None:
average_recall_over_folds = recall_stats
else:
for k in recall_stats.keys():
average_recall_over_folds[k] += recall_stats[k]
for k in average_recall_over_folds.keys():
average_recall_over_folds[k] /= float(number_of_folds)
print_recall_stats_to_csv(average_recall_over_folds, sys.stdout)
In [2]:
import matplotlib.pyplot as plt
plt.plot([10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
[0.10499188404402464, 0.14970515053390282, 0.1888193396855732, 0.22338266958456865, 0.24548006683189091,
0.26680334746718226, 0.2805234226981687, 0.2932795904934835, 0.3027138998653241, 0.3110476959302546])
plt.show()
Based on the above analysis, there is a sharp increase in recall until you reach a depth of ~80 at which point the increase in recall starts to level off (though ideally you might go higher to follow the trend). So, for now, we choose 90 as the rows setting for our ranker as a compromise between the number of results which should be. You can, of course, simply try a bunch of rows settings and evaluate overall ranker performance with each setting too.
WARNING: Each set of credentials gives you 8 rankers, since I experiment a lot, I have a convenience flag to delete rankers in case the quota is full. You obviously want to switch this flag off if you have rankers you don't want deleted.
In [6]:
rows=90
average_ndcg = 0.0
ndcg_evaluated_at = 50
for i in range(1, number_of_folds + 1):
train_set = path.join(experimental_directory, 'Fold%d' % i, 'train.relevance_file.csv')
test_set = path.join(experimental_directory, 'Fold%d' % i, 'validation.relevance_file.csv')
# Step 1: Generate a feature file that can be used to train a ranker
with smart_file_open(train_set) as infile:
labelled_train_questions = RankerRelevanceFileQueryStream(infile)
feature_file = insert_modifier_in_filename(train_set,'fcselect_features','txt')
with smart_file_open(feature_file, mode='w') as outfile:
stats = generate_rnr_features(collection_id=collection_id, cluster_id=cluster_id, num_rows=rows,
in_query_stream=labelled_train_questions, outfile=outfile, config=config)
# Step 2: Train a ranker
ranker_api_wrapper = RankerProxy(config=config)
ranker_name = 'TestRanker'
ranker_id = ranker_api_wrapper.train_ranker(train_file_location=feature_file, train_file_has_answer_id=True,
is_enabled_make_space=True, ranker_name=ranker_name)
ranker_api_wrapper.wait_for_training_to_complete(ranker_id=ranker_id)
# Step 3: Generate predictions using the ranker id
with smart_file_open(test_set) as infile:
prediction_file = insert_modifier_in_filename(test_set,'fcselect_with_ranker_predictions','txt')
labelled_test_questions = RankerRelevanceFileQueryStream(infile)
json.dump(bluemix_wrapper.generate_fcselect_prediction_scores(
test_questions=labelled_test_questions, num_rows=rows, ranker_id=ranker_id,
prediction_file_location=prediction_file, collection_id=collection_id), sys.stdout, sort_keys=True, indent=4)
# Step 4: Evaluate
labelled_test_questions.reset()
with smart_file_open(prediction_file) as preds_file:
prediction_reader = PredictionReader(preds_file, file_has_confidence_scores=True)
stats_for_fold, _ = compute_performance_stats(prediction_reader=prediction_reader,
ground_truth_query_stream=labelled_test_questions,
k=ndcg_evaluated_at)
print('\nPerformance on Fold %d' % i)
json.dump(stats_for_fold, sys.stdout, sort_keys=True, indent=4)
average_ndcg += stats_for_fold['ndcg@%d' % ndcg_evaluated_at]
average_ndcg /= number_of_folds
print('\nAverage NDCG@%d across folds: %.2f' % (ndcg_evaluated_at, average_ndcg))
As we can see, both NDCG@20 and top-1 accuracy (aka Precision@1 goes up dramatically with the supervised ranker. Refer here for troubleshooting tips with the ranker: https://developer.ibm.com/answers/questions/364292/solr-returns-the-ground-truth-relevant-results-but/